Customers churning is a serious problem for telecom companies as it is more expensive to acquire new customers than it is to keep existing ones from leaving. Telecom wants to retain existing customers by providing suitable intervention to the customers who are likely to churn but needs evidence proving that the targeted approach is accurate so that they do not waste money on customers that would have stayed at the company anyway. As the business analytics team, we were tasked to address the churning problem.
The first step in our analysis was to examine the churning problem across the firm using descriptive analytics. Our analysis showed that churners differ from non-churners on important customer characteristics (number of voice messages recorded, international plan, voice mail plan, day minutes, evening minutes, night minutes, international calls, and number of customer service calls made). We also observed that as much as 14.49% of customers churn.
Secondly, we built a logistic regression model to assess the predictive accuracy of the targeted approach by utilizing this specific predictive model for the purpose of classification. The analytics team evaluated the logistic regression model on important metrics of the holdout data set and the quality and amount of information that it provides. The logistic regression model is not good at predicting churners. While the model has an accuracy of 85% on the holdout data, its precision and recall only amount to, respectively, 46% and 16%. These results point to the conclusion that this is not the best model to approach the churning problem. We recommend that the company works on developing a better predictive model to address churn.
For the next round of statistical analysis, the analytics team will work on fixing the imbalanced classification problem (the class churners is under-represented compared to non-churners). In addition, we will build a classification tree and a neural network in an attempt to improve the predictive accuracy of churners.
Problem Description
Business Opportunity
Retaining existing customers is crucial for every telecom company as it is more expensive for them to acquire new customers than it is to keep the existing ones for leaving. Telecom wants to retain existing customer by providing suitable intervention to the customers who are likely to churn but needs evidence proving that this targeted approach is accurate.
Business Objectives
To address the churning problem at Telecom, the business objectives of this project were to examine the churning problem across the firm and to assess the accuracy of the targeted approach in predicting which customers are likely to churn.
Research Questions
Using a data analytics strategy, the analytics team conducted a study to answer the following questions:
1) How do customers who churned compare to the customers who did not churn with respect to location, account length, type of telephone plan used, number of calls and mail messages, minutes used, the amount of money they got charged for different services, and the number of customer calls made?
2) To what extent do customers churn?
3) Considering Telecom customer characteristics, is logistic regression an effective model for targeting customers who are likely to churn?
Data Gathering Process
To conduct this study, the analytics team collected data from Kaggle at https://www.kaggle.com/datasets/mnassrib/telecom-churn-datasets/data.
The data contained 1 target variable (churners vs non-churners) and and 18 input variables (state, account length, international plan, voice mail plan, number of voice messages recorded, totally day calls made, total evening calls made, total night calls made, total international calls made, total day minutes used, total evening minutes, used, total night minutes used, total international minutes used, total day charge, total evening charge, total night charge, total international charge, and number of customer service calls made).
In our analyses, we excluded the input variables “state” and “area code” as the project had a tight deadline. At a later point in time, this variable can be incorporated for further analysis.
The process of cleaning the data also involved replacing the Yes/No with True/False in the variables “international plan” and “voice mail plan”.
Code
# Data import and cleaningchurn_20 <-read.csv("churn-bigml-20.csv") %>%clean_names()churn_80 <-read.csv("churn-bigml-80.csv") %>%clean_names()# Data join and column reorderingtelecom_churn <-rbind(churn_20, churn_80) %>%select(-c(state,area_code)) %>%rename(voice_mail_messages ="number_vmail_messages") %>%select("account_length","international_plan","voice_mail_plan","voice_mail_messages","total_day_calls","total_eve_calls","total_night_calls","total_intl_calls","total_day_minutes","total_eve_minutes","total_night_minutes","total_intl_minutes","total_day_charge","total_eve_charge","total_night_charge","total_intl_charge","customer_service_calls","churn")# Data preparation for columns international_plan and voice_mail_plantelecom_churn <- telecom_churn %>%mutate(international_plan =ifelse(test = international_plan =="Yes", yes ="True",no ="False")) %>%mutate(voice_mail_plan =ifelse(test = voice_mail_plan =="Yes", yes ="True",no ="False"))
Data Analysis Plan
To reach the business objectives of this study, the analytics team used both descriptive analytics and predictive analytics. More specifically, we used descriptive analytics to examine the churning problem across the firm (questions 1 and 2). We then used predictive analytics to build a logistic regression model in order to assess the accuracy of the targeted approach in predicting which customers are likely to churn (question 3).
All the analysis was done utilizing r.
Data Description
The data used in this study contained 1 target variable (churners vs non-churners), as well as 17 potential predictors of churn:
Table1
Input Variable
Definition
Account Length
Number of days since the customer registered his/her account
International Plan
Customer has an international plan: True/False
Voice Mail Plan
Customer has a voice mail plan: True/False
Voice Mail Messages
Number of voice mail messages recorded
Total Day Calls
Total number of calls used during the day
Total Evening Calls
Total number of calls used during the evening
Total Night Calls
Total number of calls used during the night
Total International Calls
Total number of international calls used
Total Day Minutes
Total number of minutes used during the day
Total Evening Minutes
Total number of minutes used during the evening
Total Night Minutes
Total number of minutes used during the night
Total International Minutes
Total number of minutes used on international calls
Total Day Charge
Total dollars charged by day minutes used
Total Evening Charge
Total dollars charged by evening minutes used
Total Night Charge
Total dollars charged by night minutes used
Total International Charge
Total dollars charged by international minutes used
Customer Service Calls
Customer service calls made
Exploratory Data Analysis
Thanks to the report created with the package dataMaid, we can see that there are no missing values in the data set. Additionally, we observe the variable class (character, integer, and numeric) and the number of unique values for each variable.
Just to describe a few variables, let’s observe “account length”, “churn”, and “total international calls”.
Account length: We can see that the variable type of “account_length” is integer. The data is basically normally distributed. We can also observe than the values range from 1 to 243, with 101 being the median. The report recognized the numbers from 208 to 243 as possible outliers.
Churn: We can see that the variable type of “churn” is character (we will change the variable type to logical later on). We can see clearly observe that the mode is False (non-churners).
Total international calls: We can see that the variable type of “total_intl_calls” is integer. The data isright-skewed since the right tail is longer than the left tail. We can also observe than the values range from 0 to 20, with 4 being the median.
knitr::include_graphics("/Users/denis/OneDrive/Desktop/Programming for BA/NEOMA_DataProgramming-main/NEOMA_DataProgramming-main/dataMaid_telecom_churn.html")
Solution: Descriptive Analytics
Differences Between Churners and Non-Churners
As shown by the visualizations below, we see the most differences between churners and non-churners in the following variables: number of voice messages recorded, international plan, voice mail plan, day minutes, evening minutes, night minutes, international calls, and number of customer service calls made.
churners <-subset(telecom_churn, churn =="True")non_churners <-subset(telecom_churn, churn =="False")ggplot(churners) +aes(x = international_plan) +geom_bar(fill ="#112446") +geom_text(stat ="count", aes(label =after_stat(count)), vjust =-0.5) +labs(x ="International plan users", y ="Count of users", title ="International Plan", subtitle ="Churners") +theme_minimal()
Code
ggplot(non_churners) +aes(x = international_plan) +geom_bar(fill ="#EF562D") +geom_text(stat ="count", aes(label =after_stat(count)), vjust =-0.5) +labs(x ="International plan users", y ="Count of users", title ="International Plan", subtitle ="Non-churners") +theme_minimal()
Code
churners <-subset(telecom_churn, churn =="True")non_churners <-subset(telecom_churn, churn =="False")ggplot(churners) +aes(x = voice_mail_plan) +geom_bar(fill ="#112446") +geom_text(stat ="count", aes(label =after_stat(count)), vjust =-0.5) +labs(x ="Voice mail plan users", y ="Count of users", title ="Voice Mail Plan", subtitle ="Churners") +theme_minimal()
Code
ggplot(non_churners) +aes(x = voice_mail_plan) +geom_bar(fill ="#EF562D") +geom_text(stat ="count", aes(label =after_stat(count)), vjust =-0.5) +labs(x ="Voice mail plan users", y ="Count of users", title ="Voice Mail Plan", subtitle ="Non-churners") +theme_minimal()
T-test Analysis
There is also evidence of a statistically significant difference between the customer who churn and the customers who do not churn with respect to voice_mail_messages, total_intl_calls, total_day_minutes, total_eve_minutes, total_night_minutes, total_intl_minutes, total_day_charge, total_eve_charge, total_night_charge, total_intl_charge, and customer_service_calls since the p-value is less than α (0.05).
The only variables that have no statistical difference between churners and non-churners are account_length, total_day_calls, total_eve_calls, and total_night_calls since the p-value is greater than α (0.05).
# T Testtelecom_churn$churn <-as.logical(telecom_churn$churn)# List of variables to testvariables_to_test <-c("account_length", "voice_mail_messages", "total_day_calls","total_eve_calls", "total_night_calls", "total_intl_calls","total_day_minutes", "total_eve_minutes","total_night_minutes", "total_intl_minutes","total_day_charge", "total_eve_charge","total_night_charge", "total_intl_charge","customer_service_calls")# Create an empty dataframe to store t-test resultst_test_results <-data.frame(variable =character(),t_ratio =numeric(),p_value =numeric(),stringsAsFactors =FALSE)# Perform t-test for each variable and store results in dataframefor (variable in variables_to_test) {# Perform t-test t_test_result <-t.test(telecom_churn[[variable]] ~ telecom_churn$churn)# Extract t-ratio and p-value from t-test result t_ratio <- t_test_result$statistic p_value <- t_test_result$p.value# Append results to dataframe t_test_results <-rbind(t_test_results, data.frame(variable = variable, t_ratio = t_ratio,p_value = p_value))}t_test_results$p_value <-format(t_test_results$p_value, scientific =FALSE)t_test_results$p_value <-as.numeric(t_test_results$p_value) %>%round(3)t_test_results$t_ratio <-as.numeric(t_test_results$t_ratio) %>%round(3)t_test_results %>%gt() %>%tab_header(title ="T-test")
T-test
variable
t_ratio
p_value
account_length
-0.962
0.336
voice_mail_messages
5.821
0.000
total_day_calls
-1.002
0.317
total_eve_calls
-0.537
0.591
total_night_calls
-0.349
0.727
total_intl_calls
2.960
0.003
total_day_minutes
-9.685
0.000
total_eve_minutes
-5.272
0.000
total_night_minutes
-2.171
0.030
total_intl_minutes
-3.939
0.000
total_day_charge
-9.684
0.000
total_eve_charge
-5.272
0.000
total_night_charge
-2.171
0.030
total_intl_charge
-3.940
0.000
customer_service_calls
-8.955
0.000
Code
# account length t-testggbetweenstats(data = telecom_churn, x ="churn", y ="account_length") +ggtitle("T-test account length")
Code
# voice mail messages t-testggbetweenstats(data = telecom_churn, x ="churn", y ="voice_mail_messages") +ggtitle("T-test voice mail messages")
Code
# calls t-testggbetweenstats(data = telecom_churn, x ="churn", y ="total_day_calls") +ggtitle("T-test total day calls")
Code
ggbetweenstats(data = telecom_churn, x ="churn", y ="total_eve_calls") +ggtitle("T-test total eve calls")
Code
# calls t-testggbetweenstats(data = telecom_churn, x ="churn", y ="total_night_calls") +ggtitle("T-test total night calls")
Code
ggbetweenstats(data = telecom_churn, x ="churn", y ="total_intl_calls") +ggtitle("T-test total intl calls")
Code
# minutes t-testggbetweenstats(data = telecom_churn, x ="churn", y ="total_day_minutes") +ggtitle("T-test total day minutes")
Code
ggbetweenstats(data = telecom_churn, x ="churn", y ="total_eve_minutes") +ggtitle("T-test total eve minutes")
Code
# minutes t-testggbetweenstats(data = telecom_churn, x ="churn", y ="total_night_minutes") +ggtitle("T-test total night minutes")
Code
ggbetweenstats(data = telecom_churn, x ="churn", y ="total_intl_minutes") +ggtitle("T-test total_intl_minutes")
Code
# Charges t-testggbetweenstats(data = telecom_churn, x ="churn", y ="total_day_charge") +ggtitle("T-test total day charge")
Code
ggbetweenstats(data = telecom_churn, x ="churn", y ="total_eve_charge") +ggtitle("T-test total eve charge")
Code
# Charges t-testggbetweenstats(data = telecom_churn, x ="churn", y ="total_night_charge") +ggtitle("T-test total night charge")
Code
ggbetweenstats(data = telecom_churn, x ="churn", y ="total_intl_charge") +ggtitle("T-test total intl charge")
Code
# Customer service calls t-testggbetweenstats(data = telecom_churn, x ="churn", y ="customer_service_calls") +ggtitle("T-test customer service calls")
Percentage of Churners
Based on the descriptive statistics analysis, we also observe that 14.49% (483 out of 3333) of the customers of Telecom churned. On the other hand, 85.51% (2850 out of 3333) of the customers of Telecom did not churn.
Table
Code
counts <-table(telecom_churn$churn)percentages <-prop.table(counts) *100# Step 2: Create a data frameresult <-data.frame(Category =c("Non-Churners", "Churners"),Count =as.numeric(counts),Percentage =round(percentages, 2))result$Percentage.Freq <-paste0(result$Percentage.Freq, "%")result <- result %>%select(-Percentage.Var1)# Step 3: Create gt tableresult %>%gt()
Category
Count
Percentage.Freq
Non-Churners
2850
85.51%
Churners
483
14.49%
Chart
Code
ggplot(result, aes(x = Category, y = Percentage.Freq, fill = Category)) +geom_bar(stat ="identity") +labs(title ="Churners vs Non-Churners",x ="Category",y ="Percentage") +theme_minimal() +theme(legend.position ="none")
Data Preparation for Modeling
Since the primary goal of predictive modeling is evaluating the predictive accuracy of the model on new data (holdout data set), the analytics team divided the data into a training and holdout data. More specifically, the analytics team put 75% of data records (2501 out of 3,333) in the training set and the remaining 25% (832 out of 3,333) in the holdout data set.
Code
# Create a binary column for churntelecom_churn$churn <-as.logical(telecom_churn$churn)index <-createDataPartition(telecom_churn$churn, p =0.75, list =FALSE)training_data <- telecom_churn[index, ]holdout_data <- telecom_churn[-index, ]
Before starting to perform the logistic regression model, we run a correlation analysis with a standardized data set to see if some highly correlated variables need to be removed.
We discovered that the following variables are perfectly correlated (100%):
- total_day_minutes and total_day_charge
- total_eve_minutes and total_eve_charge
- total_night_minutes and total_night_charge
- total_intl_minutes and total_intl_charge
For this reason, the analytics team decided to remove total_day_charge, total_eve_charge, total_night_charge, and total_intl_charge from the logistic regression model.
In the provided list of eigenvalues, the range of values is relatively small, with the largest eigenvalue being approximately 1.0878 and the smallest non-zero eigenvalue being approximately 0.9392.
While there is some variance among the eigenvalues, the difference in magnitudes between the largest and smallest eigenvalues is not substantial. Therefore, based solely on the eigenvalues provided, it does not appear there is severe multi-collinearity present in your data.
The analytics team decided to exclude total_day_calls, total_eve_calls, and total_night calls as predictors of churn since they are theoretically related to their respective total of minutes (they represent the same thing).
Based on the correlation analysis performed, the final list of predictors to be included in the model is the following: account_length, international_plan, total_intl_calls, voice_mail_plan, voice_mail_messages, total_day_minutes, total_eve_minutes, total_night_minutes, total_intl_minutes, customer_service_calls.
Therefore, the logistic regression will be performed utilizing these 10 predictors of churn.
The analytics team decided to utilize the logistic regression model. The logistic regression model allows decision makers to predict the probability of a particular categorical response for a given set of independent variables. Logistic regression models use a function of Y called logit. This logit can be mapped back to a probability which, in turn, can be mapped back to a class (churners vs non-churners).
The following image contains the estimated coefficients for each predictor variable of churn, their standard errors, z-values, and the associated p-values.
The significance codes at the bottom explain the threshold for the p-values denoted by asterisks. For instance, *** next to a coefficient’s p-value indicates a very high level of statistical significance.
The AIC (Akaike Information Criterion) is a measure of the relative quality of the statistical model for a given set of data. The lower the AIC, the better the model.
Overall, the model is statistically significant, with several predictors showing strong associations with the likelihood of churn.
Code
logistic_model <-glm(training_data$churn ~ ., data = training_data, family =binomial(link ="logit"))summary(logistic_model)
Call:
glm(formula = training_data$churn ~ ., family = binomial(link = "logit"),
data = training_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.651964 0.629467 -13.745 < 2e-16 ***
account_length 0.002195 0.001607 1.366 0.171831
international_planTrue 1.994215 0.171274 11.643 < 2e-16 ***
voice_mail_planTrue -2.360739 0.662644 -3.563 0.000367 ***
voice_mail_messages 0.045054 0.020680 2.179 0.029357 *
total_intl_calls -0.090857 0.029435 -3.087 0.002024 **
total_day_minutes 0.013013 0.001258 10.345 < 2e-16 ***
total_eve_minutes 0.007496 0.001317 5.692 1.26e-08 ***
total_night_minutes 0.004332 0.001289 3.361 0.000777 ***
total_intl_minutes 0.101902 0.023661 4.307 1.66e-05 ***
customer_service_calls 0.545444 0.045119 12.089 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2071.8 on 2500 degrees of freedom
Residual deviance: 1618.7 on 2490 degrees of freedom
AIC: 1640.7
Number of Fisher Scoring iterations: 6
In this logit equation, p represents the probability of a customer churning, while the predictors represent various characteristics or behaviors of the customer.
Just to describe a few arguments of the logit function, let’s observe what the intercept and international plan represent:
- Intercept (-8.1375) is the baseline log-odds of a customer churning when all predictor variables are zero.
- international_planTrue (2.1135): Having an international plan increases the log-odds of a customer churning by 2.1135 units.
The ROC curve helps analyze the performance of classification at various threshold settings. High true positive rates (sensitivity) of a class means that the model has performed well in classifying that particular class. The AUC (area under the curve) ranges from 0.5 (random guessing) to 1 (perfect discrimination between classes). The higher the AUC, the better the model performs.
In our case, the area under the curve of the holdout data set is 84% (AUC = 0.842). This means that the model does not perform great in predicting true positives at various threshold settings. The analytics team recommends developing a better model to predict churners and provide incentives to them.
## Does not work. It is reversedplot.new()plot(1, type ="n", xlim =c(0, 1), ylim =c(0, 1), xlab ="False Positive Rate", ylab ="True Positive Rate", main ="ROC Curve")# Plot ROC curvelines(roc_curve, col ="blue", lwd =2)# Add diagonal line for baselineabline(a =0, b =1, col ="black", lty =2)# Add legendlegend("bottomright", legend =c("ROC Curve", "Baseline (AUC = 0.5)"), col =c("blue", "black"), lty =c(1, 2), lwd =c(2, 1))
Conclusion and Recommendations
The logistic regression model is not good at predicting churners. While the model has an accuracy of 85% on the holdout data, its precision and recall only amount to, respectively, 46% and 16% (the model predicts non-churners much better than churners, most probably as an effect of having many more non-churners). The AUC is also not great with 0.82.
These results point to the conclusion that this is not the best model to approach the churning problem.
We recommend that the company works on developing a better predictive model to address churn. Specifically, we recommend fixing the imbalanced classification problem (the class churners is under-represented compared to non-churners). We also suggest to test other algorithms like classification tree and neural network in an effort to improve the predictive accuracy of churners and the effectiveness of the targeted approach in retaining probable churners.